Introduction to Open Data Science

Author: Himanshu Chheda

This course is designed to learn open data science , where open means publishing the code which eventually leads to reproducibility. Data science means number crunching to analyze and identify inherent patters in the data. The link to my coding capabilities attached below:

Github: himanshuchheda


Performing and Interpreting regression

This weeks work has been mainly related to data wrangling and running small regression analyses

  lrn14 <- read.table("~/Documents/GitHub/IODS-project/data/learning2014.txt", header = T, sep = "\t")
  dim(lrn14)
## [1] 166   7
  str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
library(GGally) # Loading library for plotting
library(ggplot2) # Loading library for plotting
p <- ggpairs(lrn14, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

my_model <- lm(Points ~ attitude + stra + surf, data = lrn14)
summary(my_model)
## 
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
final_model <- lm(Points ~ attitude, data = lrn14)
summary(final_model)
## 
## Call:
## lm(formula = Points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
par(mfrow = c(2,2))
plot(my_model, which = 1)
plot(my_model, which = 2)
plot(my_model, which = 5)


Logistic Regression

This week’s work has been mainly related to performing logistic regression in R

Reading in the file & loading the required libraries

alc <- read.table("~/Documents/GitHub/IODS-project/data/alcohol_consumption.txt", header = T, sep = "\t")
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

Learning more about the data

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
dim(alc)
## [1] 382  35

As can be seen above, this file has information about 382 individuals with 35 variables for each student. This data is regarding alcohol consumption and high school performance of two Portugese schools. This data also includes demographic information, student grades, social and school related features. We will perform a logistic regression comparing students drinking high amounts of alcohol vs students drinking low amounts of alcohol

Take a guess

Exploring the data!

  1. Using Barplots
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

  1. Using cross-tabulations
alc %>% group_by(high_use, sex) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 4 x 4
## # Groups:   high_use [?]
##   high_use    sex count mean_grade
##      <lgl> <fctr> <int>      <dbl>
## 1    FALSE      F   156   11.39744
## 2    FALSE      M   112   12.20536
## 3     TRUE      F    42   11.71429
## 4     TRUE      M    72   10.27778
  1. Using box-plots
g1 <- ggplot(alc, aes(x = high_use, y = absences, col = sex)) + geom_boxplot() + ylab("Grades") + ggtitle("Absences")
g2 <- ggplot(alc, aes(x = high_use, y = G3, col = sex)) + geom_boxplot() + ylab("Grades") + ggtitle("Grades")
g3 <- ggplot(alc, aes(x = high_use, y = freetime, col = sex)) + geom_boxplot() + ylab("Freetime") + ggtitle("Freetime")
grid.arrange(g3, g1, g2, ncol = 2)

Based on the above plots, it seems that all the averages for the variables seem to be much more different in males than in females.

Performing logistic regression

mod1 <- glm(high_use ~ freetime + absences + G3 + sex, data = alc, family = "binomial")
summary(mod1)
## 
## Call:
## glm(formula = high_use ~ freetime + absences + G3 + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0212  -0.8299  -0.6063   1.0926   2.1162  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.87119    0.61712  -3.032 0.002428 ** 
## freetime     0.28229    0.12544   2.250 0.024423 *  
## absences     0.09394    0.02293   4.097 4.18e-05 ***
## G3          -0.07348    0.03610  -2.036 0.041778 *  
## sexM         0.88194    0.24645   3.579 0.000346 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 420.33  on 377  degrees of freedom
## AIC: 430.33
## 
## Number of Fisher Scoring iterations: 4
OR <- exp(coef(mod1))
CI <- exp(confint(mod1))
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1539399 0.04471733 0.5058444
## freetime    1.3261680 1.03978990 1.7022310
## absences    1.0984987 1.05237029 1.1515837
## G3          0.9291511 0.86511705 0.9970999
## sexM        2.4155823 1.49749137 3.9428978

As can be seen from above, all the speculative variables seem to be associated with high alcohol usage. The largest being for male, where the odds suggest that a male student is 2.42 times more likely to consume high_alcohol as compared to a female student. Freetime and absences also have odds ratio more than 1 indicating they are directly correlated with high alcohol consumption. Grades as expected are inversely correlated with high alcohol consumption.

Testing the predictive power of the model based on these 4 variables

probabilities <- predict(mod1, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table()
##         prediction
## high_use      FALSE       TRUE
##    FALSE 0.66230366 0.03926702
##    TRUE  0.22513089 0.07329843
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()

As can be seen from the table and the plot, this model isn’t optimal. The model error is ~25%. Furthermore, there is a high number of false negatives as compared to false positives.


Clustering and Classification

This week’s work is related to classifying and clustering data

Loading the dataset and the required libraries

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.2
## corrplot 0.84 loaded
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.2
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  1.3.4     ✔ purrr   0.2.4
## ✔ readr   1.1.1     ✔ stringr 1.2.0
## ✔ tibble  1.3.4     ✔ forcats 0.2.0
## Warning: package 'purrr' was built under R version 3.4.2
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ gridExtra::combine() masks dplyr::combine()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ MASS::select()       masks dplyr::select()
library(dplyr)
library(ggplot2)
data("Boston")

Exploring the structure and dimension of the dataset

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The Boston dataset contains information about housing in Boston, Mass. It’s a relatively small dataset with only 506 observations and contains 14 variables. Some of these variables include crime rates, proportion of residential land zoned, nitric oxide concentration, average number of rooms per dwelling and so on.

Graphical overview of the dataset

#pairs(Boston)
gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

cor_matrix<-cor(Boston)
corrplot(cor_matrix %>% round(2), method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

As can be seen above, the variables follow different kinds of distributions. We have also calculated the correlations between these distributions. The highest correlation is between the variables “rad” and “tax” where rad stands for accessibility to radial highways. There also seems to be a high negative correlation between proportion of owner occupied units built prior to 1940 and weighted distance to one of the 5 working centres in Boston represented by variables “age” and “dis”.

Standardizing the dataset

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
ind <- sample(nrow(boston_scaled),  size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Scaling helps in normalizing the continuous variables. This basically means that we subtract mean from all the values and divide by the standard deviation of that variable. Furthermore, we have created a categorical variable for crime using the quantiles as break points. We have further divided the dataset into training (80%) and test sets (20%).

Creating Linear Discriminant Analysis model

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2500000 0.2450495 0.2524752 
## 
## Group means:
##                   zn      indus          chas        nox         rm
## low       0.94157733 -0.8854392 -0.0793339581 -0.8595014  0.4207280
## med_low  -0.05953014 -0.2754450  0.0005392655 -0.5603640 -0.1678177
## med_high -0.39975350  0.1661412  0.1651265141  0.4104981  0.1064878
## high     -0.48724019  1.0171096 -0.0407349362  1.0657879 -0.4146501
##                 age        dis        rad        tax     ptratio
## low      -0.8725460  0.8952815 -0.7015101 -0.7252251 -0.41736515
## med_low  -0.2962204  0.3810652 -0.5440892 -0.4542263 -0.07138447
## med_high  0.3974486 -0.3905543 -0.4227186 -0.3256471 -0.32565648
## high      0.7950240 -0.8461146  1.6382099  1.5141140  0.78087177
##                black        lstat        medv
## low       0.37813681 -0.773093789  0.47804237
## med_low   0.34637117 -0.091645952 -0.03187986
## med_high  0.08284963 -0.002593142  0.16798456
## high     -0.73281093  0.905063375 -0.76158483
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.115848123  0.69800600 -0.93669491
## indus   -0.003834602 -0.12495330  0.39206741
## chas    -0.100687681 -0.01246109  0.10257318
## nox      0.363121739 -0.79262489 -1.37812732
## rm      -0.106355361 -0.01552473 -0.18162807
## age      0.225537306 -0.29299748  0.02186304
## dis     -0.094930837 -0.19880548  0.32858141
## rad      3.377694077  0.85736482 -0.10316839
## tax      0.078274925  0.02066276  0.56895311
## ptratio  0.140676465  0.02712051 -0.34689809
## black   -0.127069432  0.07960893  0.18405788
## lstat    0.263055910 -0.26153421  0.45909712
## medv     0.240816226 -0.55423541 -0.13247206
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9552 0.0327 0.0120
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Predicting on the test data

correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16       7        2    0
##   med_low    6      15        4    0
##   med_high   1       8       16    2
##   high       0       0        0   25

Using the model fitted on the training data, we predict the classes for the test data and further calculated the accuracy of the model by comparing it to the real classes / data. We observe that we perform really well for the “high” class / the values lying in the 4th quantile. We accurately predict all well with 3 false positives. For the other classes, our prediction is not really good. Especially so for classes “med_low” and “med_high”

Clustering the data

library(MASS)
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
dist_eu <- dist(boston_scaled, method = "euclidean")
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)

So we tried clustering the Boston dataset. We started off with not knowing what the ideal number of clusters would be. We first calculate the euclidean distances. We then create different models with 1 - 10 clusters. As can be seen from the plot, the total within sum of squares sees the biggest drop between 1 and 2 hence it seems that the ideal number of cluster for this dataset is 2. Based on the last plot, we see that some variables along with their interactions are very good at clustering such as the variable crime, whereas others aren’t such as lstat.


Dimensionality Reduction

This week we will explore dimensionality reduction techniques including PCA and MCA

Loading the data and the required libraries for the analysis

library(corrplot)
library(dplyr)
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 3.4.2
library(GGally)
library(ggplot2)
library(tidyr)
human <- read.table("~/Documents/GitHub/IODS-project/data/human.txt", header = T, sep = "\t", row.names = 1)
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8

Graphical overview of the data

ggpairs(human)

cor(human) %>% corrplot(title = "Correlation plot", type = "upper", tl.cex = 0.6)

Performing Principal Component Analysis

pca_human <- prcomp(human)
biplot(pca_human, choices = 1:2, cex = c(0.8,1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

Performing Principal Component Analysis (Standardized data)

human_std <- scale(human)
pca_human_std <- prcomp(human_std)
summary(pca_human_std)
## Importance of components%s:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
biplot(pca_human_std, xlab = "PC1 (53.6%)", ylab = "PC2 (16.2%)",choices = 1:2, cex = c(0.8,1), col = c("grey40", "deeppink2"))

Multiple Correspondance Analysis

library(FactoMineR)
library(dplyr)
library(ggplot2)
data(tea)
#keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
#tea_time <- select(tea, keep_columns)
tea_time <- data.frame(tea$Tea, tea$How, tea$how, tea$sugar, tea$where, tea$lunch)
summary(tea_time)
##       tea.Tea     tea.How                  tea.how       tea.sugar  
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                 tea.where       tea.lunch  
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ tea.Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ tea.How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ tea.how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ tea.sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ tea.where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ tea.lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
dim(tea_time)
## [1] 300   6
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## tea.Tea            | 0.126 0.108 0.410 |
## tea.How            | 0.076 0.190 0.394 |
## tea.how            | 0.708 0.522 0.010 |
## tea.sugar          | 0.065 0.001 0.336 |
## tea.where          | 0.702 0.681 0.055 |
## tea.lunch          | 0.000 0.064 0.111 |
plot(mca, invisible=c("var"))

cats = apply(tea_time, 2, function(x) nlevels(as.factor(x)))
mca_obs_df = data.frame(mca$ind$coord)
mca_vars_df = data.frame(mca$var$coord, Variable = rep(names(cats), cats))
ggplot(data=mca_vars_df, aes(x = Dim.1, y = Dim.2, label = rownames(mca_vars_df))) + geom_hline(yintercept = 0, colour = "black") + geom_vline(xintercept = 0, colour = "black") + geom_text(aes(colour=Variable)) + ggtitle("MCA plot")